Github link: https://github.com/TerryTian21/JSC370-Final-Project

Introduction

Abstract

  A report by the FRED (Federal Reserve of St. Louis) on labour market conditions highlighted a drastic change in software engineer job postings within the past 5 years. Indexed on Feb 1, 2020 = 100, the number of postings exponentially increases, peaking in early 2022 (Index = 240). Yet, seemingly just as rapid, the number of postings fell to a low in late 2023. With numerous tech-unicorns announcing layoffs. The tech bubble has appeared to burst. This paper will evaluate the software/data engineering market in 2021 and 2023, showing the differences in available roles, postings by location, and employee skill-set requirements.

Figure 1: Line chart of Indeed jobs postings with baseline Feb 1, 2020. The chart is seasonally adjusted on historic patterns in 2017-2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately.

Figure 1: Line chart of Indeed jobs postings with baseline Feb 1, 2020. The chart is seasonally adjusted on historic patterns in 2017-2019. Each series, including the national trend, occupational sectors, and sub-national geographies, is seasonally adjusted separately.

Hypothesis

The two main questions of interest are as follows:

  1. What are the differences between 2021 and 2023 postings?
  2. Can we use posting-metadata to predict salaries?

The goal of this paper is to provide clarity into why so many engineers have been struggling to find employment opportunities in North America. Three datasets are used to answer the above questions; 2 Kaggle datasets and a dataset found on Github.

Methods

Data Summary

  The first Kaggle dataset was procured by Yazeed Fares, titled Software Engineering Jobs Dataset. The dataset contains 9380 observations and 8 features and was collected via scrapping LinkedIn Jobs. Although the scrapping was performed on Dec. 25, 2023, not all jobs were posted on that specific date. LinkedIn retains job postings for up to 6 months. For the purpose of this exploration, we will claim this is a reasonable sample of job postings in 2023.

  The second dataset uploaded by Arsh Koneru, Zoey Yu Zou contains a comprehensive aggregation of LinkedIn Job postings in 2023/2024. This dataset contains a total of 11 .csv files initial stored as tables in a SQL database. However, we are only interested in posting metadata. As a result information on companies, industries, benefits are disregarded. The primary purpose of this dataset, is to supplement salary information for software engineering job postings (not contained in dataset 3).

  The third dataset was developed by Mike Lawrence, a Machine Learning Engineer at Google. The dataset contains 8261 observations and 13 features. Similarly, this dataset was also collected from scraped LinkedIn postings; collected in October 2021.

  Since the dimensions of each dataset is different, the first step is to subset datasets into matching features for the purpose of comparison. The variables of interest are listed below. After subsetting the data, all NA values are removed.

Figure 2: Summary and Description of Variables of Interest for 2021 and 2023 Datasets
Variables Type Description
Company character Name of Company
Description character Description of job including but not limited to company overview, requirements, skillset
Title character Name of position
Location character Location of Job
Seniority character Classification of role based on experience, technical expertise, leadership responsibilities
Year factor Year the Job was Posted

Data Wrangling

Titles

  Due to the structure of job titles additional data-wrangling is required to get proper categorization. For example, variance between each posting could result in different titles representing the same type of position (e.g. Sr. Software Engineer vs. Senior Software Engineer). This would affect group_by() functions, resulting in many more categories than necessary. Thus, custom title’s are defined based on keyword matches. Using the case_when() function, titles are classified from specific to generic. A title like “Front-end Software Engineer” gets classified as Frontend Software Engineer rather than Software Engineer.

Figure 3: Regex used to create classification levels for posting titles.
Title Pattern
Back End Engineer Back-end, backend
Cloud Engineer Cloud
Data Engineer Data
Data Scientist Data Scientist
DevOps Engineer Devops
Embedded Systems Engineer Embedded, System
Front End Engineer Front-end, front end, frontend
Full Stack Engineer full stack|full-stack
Machine Learning Engineer Machine Learning, AI, Artificial Intelligence
Mobile Software Engineer Mobile, iOS, Android
Other .*
QA Engineer Test, Quality, QA
Research Engineer Research, Scientist
Security Engineer Security, Cyber
Site Reliability Engineer Site Reliability, site-reliability
Software Engineer Software

Seniority

  Analogous to titles, seniority levels are rather inconsistent between the 2 datasets. The 2021 dataset has 8 levels of seniority while the 2023 dataset only contains 2 classifications. In order to maintain homogeneity between the classifications in both datasets, custom Seniority levels are defined based on keywords in the title (e.g. Staff ~ Staff Level).

Figure 4: Regex used to create classification levels for posting Seniority Level
Seniority Pattern
Principal Principal
Staff Staff
Lead Lead
Senior Sr., Sr, Senior, III
Founding Founding
Manager Manager
Junior Entry Level, Junior, Entry-Level, Graduate, Jr., II, Jr, I
Junior Entry level, Associate
Senior Mid-Senior level
None Specified .*

GeoData

  Additional wrangling was required to plot Postings Count ~ Location. The provided location data only contains posting location formatted by City, State. However, graphing libraries (ggmap) require latitude and longitude values to plot location data. With some experimentation, the most effective visualization utilized State level groupings to plot posting data. Step 1 was to use regex, and extract state abbreviation from each location string. For the remaining data which contained State information, Google Geocoding API was used to translate coordinates for each state. Due to some postings residing outside the United States and unparsable location data, it was not possible to each a State value for every location. 7258 from 2021 and 6948 from 2023 remained available for plotting.

Premilinary

  The following figures represent EDA on our variables of interest. All variables are either factors or textual, hence visualizations are limited to bar charts listing the (top-n) counts grouped by each feature.

 Figure 5 shows the comparison of postings in 2021 to postings in 2023. Note that this list is not exhaustive of all postings in 2021/2023 and shouldn’t be taken as contradictory evidence to the hypothesis. At the time of collection, this was the number of postings available on LinkedIn. It is very possible that the scrapper missed postings, or volumes are lower/higher at the given point of time the scrapper aggregated the dataset.

Figure 5: Comparison of number of postings in 2021 and 2023.

Figure 5: Comparison of number of postings in 2021 and 2023.

  ‘Software Engineer’ was the most popular role in both years, but the key difference is a lack of specificity for 2023 roles. The 2021 dataset has over 1000 occurances of MLE, Site Reliability Engineer and Data Scientist postings while the second most occurances in the 2023 dataset is 700 postings of Embedded Systems Engineers. If we disregard the notion that the 2023 dataset wasn’t scrapping for Data Science related roles, the difference between the 2 years is more reasonable.

Figure 6: Comparison of top 10 posting titles in 2021 cs 2023.

Figure 6: Comparison of top 10 posting titles in 2021 cs 2023.

  One noticeable difference, however, is the desired seniority level. In 2021 there were 2687 postings for entry-level/junior engineer roles and was the most frequent seniority. However, The 2023 dataset saw a large shift towards senior roles with the vast majority of postins being for Senior Engineer (4183) and also increased increased postings for Staff, Principal and Lead Engineer roles.

Figure 7: Comparison of seniority counts for postings in 2021 vs 2023

Figure 7: Comparison of seniority counts for postings in 2021 vs 2023

  Althought its possibly attributed to timing of data collection, companies posting in 2021 are more traditional “big-tech” while most postings from 2023 are scattered. In 2021, Apple posted 600 openings followed by Microsoft, Uber, Salesforce all with ~100 postings respectively. 2023 saw a different demographic. There was a lack of postings from popular Saas and Tech Giants. In fact, the most number of postings in 2023 comes from Jobs for Humanity - a platform for “Connecting historically under represented talent to welcoming employers across the globe”. This is not anomalous. According to Layoffs.fyi 1,036 tech companies laid off a total of 238,397 employees in the first nine months of 2023. Therefore, we would expect to see more postings come from niche sectors like U.S. Defence (Northrop Grunman) and recruiting agencies ( Recruiting from Scratch & IP Recruiter Group).

Figure 8: Companies of company counts in 2021 vs 2023.

Figure 8: Companies of company counts in 2021 vs 2023.

  Figure 9 represents the geographic locations of job postings. Each bubble, is indicative of the number of postings in the given state; relative to other bubbles on the map. There isn’t a significant difference between the two years, both seeing the largest number of postings in California, Texas and New York - the “tech hubs” of the U.S.

Figure 9: Maps of The United States showing relative postings counts by state.

Figure 9: Maps of The United States showing relative postings counts by state.

Summary

Findings

  From an initial breakdown of datasets and variables, it was discovered that 2021 and 2023 saw differences in metadata associated with postings. Location counts, was the only consistent factor between the two years, while Titles, Seniority and Companies data all support the argument of an increased difficulty for job-seekers in 2023. Many popular destinations didn’t post opportunities for New Grads / Entry Levels and sought more senior or leadership positions.

Future steps.

  The next step of the project will tackle NLP and Prediction models. Job skillsets, years of experience, and sentiment can be extracted and compared for the two years. An additionaly dataset (dataset 2) containing salaries of SWE jobs in 2023 will be introduced to compare wages. A numeric feature, allows for further exploration of variable relationships such as Title~Salary, Location~Salary, Company~Salary. Resultingly, MLR, GLMM, and Boosting models will be trained on the new data to answer question 2 of our hypothesis - salary prediction.